# NY TAXI DATA SCIENCE FUN

NY TAXI DATA SCIENCE FUN

 
### Basic Questions:
1. What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?
2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?
3. What is the hourly taxi activity for each day of the week?
4. Which trip has the most consistent fares?
### Open Questions:
1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?
2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?
3. If you were a taxi owner, how would you maximize your earnings in a day?
4. If you run a taxi company, how would you maximize your earnings?

Basic Questions:

  1. What are the distributions of the number of passengers per trip, payment type, fare amount, tip amount, and total amount?

  2. What are top 5 busiest hours of the day, and the top 10 busiest locations of the city?

  3. What is the hourly taxi activity for each day of the week?

  4. Which trip has the most consistent fares?

Open Questions:

  1. Can you predict the fare and tip amount based on the pickup / drop off location, time, and day of the week?

  2. Can you predict the pickup / drop off geographical distribution for each hour of a weekday?

  3. If you were a taxi owner, how would you maximize your earnings in a day?

  4. If you run a taxi company, how would you maximize your earnings?

In [1]:
 
t=1
t
Out[1]:
1
In [7]:
import pandas as pd
import numpy as np
import matplotlib  
import matplotlib.pyplot as plt 
import numpy as np
import plotly.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly import tools
#initiate the Plotly Notebook mode
init_notebook_mode()
df_big = pd.read_csv('yellow_tripdata_2016-01.csv')
#df_big_clean=df_big.fillna(df_big.mean())#df_big.dropna(axis=1)
df_big_clean=df_big
#df_big_clean <- df_big[!(is.na(df$start_pc) | df$start_pc==""), ] #| is an or-operator and ! inverts. 
#Hence, the command above displays all rows, which are not b) NA or b) equal to ""
df=df_big_clean.loc[0:10000,:]  #use reduces data points for testing mode
print(df_big.shape)
print(df_big_clean.shape)
df
(2389990, 19)
(2389990, 19)
Out[7]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2016-01-01 00:00:00 2016-01-01 00:00:00 2 1.10 -73.990372 40.734695 1 N -73.981842 40.732407 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
1 2 2016-01-01 00:00:00 2016-01-01 00:00:00 5 4.90 -73.980782 40.729912 1 N -73.944473 40.716679 1 18.0 0.5 0.5 0.00 0.0 0.3 19.30
2 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 10.54 -73.984550 40.679565 1 N -73.950272 40.788925 1 33.0 0.5 0.5 0.00 0.0 0.3 34.30
3 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 4.75 -73.993469 40.718990 1 N -73.962242 40.657333 2 16.5 0.0 0.5 0.00 0.0 0.3 17.30
4 2 2016-01-01 00:00:00 2016-01-01 00:00:00 3 1.76 -73.960625 40.781330 1 N -73.977264 40.758514 2 8.0 0.0 0.5 0.00 0.0 0.3 8.80
5 2 2016-01-01 00:00:00 2016-01-01 00:18:30 2 5.52 -73.980118 40.743050 1 N -73.913490 40.763142 2 19.0 0.5 0.5 0.00 0.0 0.3 20.30
6 2 2016-01-01 00:00:00 2016-01-01 00:26:45 2 7.45 -73.994057 40.719990 1 N -73.966362 40.789871 2 26.0 0.5 0.5 0.00 0.0 0.3 27.30
7 1 2016-01-01 00:00:01 2016-01-01 00:11:55 1 1.20 -73.979424 40.744614 1 N -73.992035 40.753944 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
8 1 2016-01-01 00:00:02 2016-01-01 00:11:14 1 6.00 -73.947151 40.791046 1 N -73.920769 40.865578 2 18.0 0.5 0.5 0.00 0.0 0.3 19.30
9 2 2016-01-01 00:00:02 2016-01-01 00:11:08 1 3.21 -73.998344 40.723896 1 N -73.995850 40.688400 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
10 2 2016-01-01 00:00:03 2016-01-01 00:06:19 1 0.79 -74.006149 40.744919 1 N -73.993797 40.741440 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
11 2 2016-01-01 00:00:03 2016-01-01 00:15:49 6 2.43 -73.969330 40.763538 1 N -73.995689 40.744251 1 12.0 0.5 0.5 3.99 0.0 0.3 17.29
12 2 2016-01-01 00:00:03 2016-01-01 00:00:11 4 0.01 -73.989021 40.721539 1 N -73.988960 40.721699 2 2.5 0.5 0.5 0.00 0.0 0.3 3.80
13 1 2016-01-01 00:00:04 2016-01-01 00:14:32 1 3.70 -74.004303 40.742241 1 N -74.007362 40.706936 1 14.0 0.5 0.5 3.05 0.0 0.3 18.35
14 1 2016-01-01 00:00:05 2016-01-01 00:14:27 2 2.20 -73.991997 40.718578 1 N -74.005135 40.739944 1 11.0 0.5 0.5 1.50 0.0 0.3 13.80
15 2 2016-01-01 00:00:05 2016-01-01 00:07:17 1 0.54 -73.985161 40.768951 1 N -73.990227 40.761730 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
16 2 2016-01-01 00:00:05 2016-01-01 00:07:14 1 1.92 -73.973091 40.795361 1 N -73.978371 40.773151 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
17 1 2016-01-01 00:00:06 2016-01-01 00:04:44 1 1.70 -73.982101 40.774696 1 Y -73.970940 40.796707 1 7.0 0.5 0.5 1.65 0.0 0.3 9.95
18 2 2016-01-01 00:00:06 2016-01-01 00:07:14 1 1.38 -73.994843 40.718498 1 N -73.989807 40.734230 1 7.0 0.5 0.5 1.66 0.0 0.3 9.96
19 1 2016-01-01 00:00:07 2016-01-01 00:20:35 2 4.90 -73.953033 40.672115 1 N -73.986572 40.710594 1 19.0 0.5 0.5 4.06 0.0 0.3 24.36
20 1 2016-01-01 00:00:07 2016-01-01 00:09:49 1 1.80 -73.989166 40.726589 1 N -74.009483 40.715073 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
21 2 2016-01-01 00:00:08 2016-01-01 00:18:51 1 3.09 -73.999069 40.720173 1 N -73.973389 40.756561 2 14.5 0.5 0.5 0.00 0.0 0.3 15.80
22 2 2016-01-01 00:00:08 2016-01-01 00:04:37 1 0.72 -73.997139 40.747219 1 N -74.004486 40.751797 2 5.0 0.5 0.5 0.00 0.0 0.3 6.30
23 2 2016-01-01 00:00:08 2016-01-01 00:03:24 1 0.69 -73.997414 40.736675 1 N -73.985664 40.732681 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
24 1 2016-01-01 00:00:09 2016-01-01 00:19:03 3 5.30 -73.997131 40.736961 1 N -73.928421 40.755581 1 18.0 0.5 0.5 3.85 0.0 0.3 23.15
25 1 2016-01-01 00:00:09 2016-01-01 00:07:18 2 1.20 -73.963913 40.712173 1 N -73.951332 40.712200 2 7.0 0.5 0.5 0.00 0.0 0.3 8.30
26 2 2016-01-01 00:00:10 2016-01-01 00:06:15 2 0.97 -73.999397 40.743900 1 N -73.988876 40.745319 2 6.0 0.5 0.5 0.00 0.0 0.3 7.30
27 2 2016-01-01 00:00:10 2016-01-01 00:02:20 1 0.87 -73.954407 40.778069 1 N -73.948929 40.788582 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
28 2 2016-01-01 00:00:12 2016-01-01 00:01:17 1 0.13 -73.991653 40.754559 1 N -73.990601 40.756119 2 3.0 0.5 0.5 0.00 0.0 0.3 4.30
29 1 2016-01-01 00:00:14 2016-01-01 00:13:02 1 2.40 -73.995598 40.744240 1 N -73.985458 40.768711 1 11.0 0.5 0.5 3.05 0.0 0.3 15.35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9971 2 2016-01-02 01:45:26 2016-01-02 01:51:52 2 2.21 -73.992805 40.747776 1 N -73.986519 40.771732 2 8.5 0.5 0.5 0.00 0.0 0.3 9.80
9972 1 2016-01-02 01:45:27 2016-01-02 01:48:02 1 0.60 -73.988205 40.759205 1 N -73.982246 40.767685 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9973 1 2016-01-02 01:45:27 2016-01-02 02:00:46 1 4.60 -73.989395 40.760468 1 N -73.920860 40.743256 2 16.0 0.5 0.5 0.00 0.0 0.3 17.30
9974 1 2016-01-02 01:45:27 2016-01-02 02:17:22 3 12.90 -74.004295 40.707962 1 N -73.844490 40.722347 1 38.5 0.5 0.5 0.00 0.0 0.3 39.80
9975 1 2016-01-02 01:45:28 2016-01-02 02:03:06 1 6.10 -74.000961 40.731586 1 N -73.941544 40.800468 1 19.0 0.5 0.5 2.22 0.0 0.3 22.52
9976 1 2016-01-02 01:45:28 2016-01-02 01:51:40 1 1.20 -74.010986 40.710609 1 N -74.010986 40.710609 2 6.5 0.5 0.5 0.00 0.0 0.3 7.80
9977 1 2016-01-02 01:45:28 2016-01-02 01:54:08 1 1.60 -73.973106 40.758457 1 N -73.996124 40.760876 2 7.5 0.5 0.5 0.00 0.0 0.3 8.80
9978 2 2016-01-02 01:45:28 2016-01-02 01:56:31 3 3.13 -74.002403 40.718761 1 N -73.977814 40.745529 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9979 2 2016-01-02 01:45:28 2016-01-02 01:58:02 1 3.09 -73.961632 40.764370 1 N -73.919220 40.755932 2 11.5 0.5 0.5 0.00 0.0 0.3 12.80
9980 2 2016-01-02 01:45:28 2016-01-02 01:49:02 1 0.91 -73.994820 40.721390 1 N -73.985573 40.727058 1 5.0 0.5 0.5 1.26 0.0 0.3 7.56
9981 1 2016-01-02 01:45:29 2016-01-02 01:54:47 3 2.30 -74.003494 40.741982 1 N -73.981689 40.764687 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9982 1 2016-01-02 01:45:29 2016-01-02 01:54:09 1 2.00 -73.999969 40.728603 1 N -73.978676 40.744957 1 9.0 0.5 0.5 2.05 0.0 0.3 12.35
9983 2 2016-01-02 01:45:29 2016-01-02 01:57:43 1 4.89 -73.993301 40.720043 1 N -73.952782 40.742481 2 16.5 0.5 0.5 0.00 0.0 0.3 17.80
9984 2 2016-01-02 01:45:30 2016-01-02 01:58:14 5 3.81 -73.972496 40.677151 1 N -73.926888 40.668835 1 13.5 0.5 0.5 2.96 0.0 0.3 17.76
9985 1 2016-01-02 01:45:31 2016-01-02 01:56:48 1 2.40 -73.910530 40.744858 1 N -73.914238 40.759933 2 10.5 0.5 0.5 0.00 0.0 0.3 11.80
9986 1 2016-01-02 01:45:31 2016-01-02 01:49:36 1 0.60 -74.000542 40.729885 1 N -74.004311 40.722778 2 4.5 0.5 0.5 0.00 0.0 0.3 5.80
9987 2 2016-01-02 01:45:31 2016-01-02 01:47:39 1 0.43 -73.994431 40.727772 1 N -74.000458 40.727341 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
9988 2 2016-01-02 01:45:31 2016-01-02 02:04:54 1 7.40 -73.953514 40.775261 1 N -73.881020 40.755943 1 23.5 0.5 0.5 4.96 0.0 0.3 29.76
9989 2 2016-01-02 01:45:31 2016-01-02 01:55:57 1 4.23 -73.981857 40.746017 1 N -73.943085 40.795063 2 13.5 0.5 0.5 0.00 0.0 0.3 14.80
9990 2 2016-01-02 01:45:31 2016-01-02 01:50:24 1 0.94 -73.983353 40.729210 1 N -73.983353 40.729210 1 5.5 0.5 0.5 1.36 0.0 0.3 8.16
9991 1 2016-01-02 01:45:32 2016-01-02 02:03:24 1 5.90 -73.954666 40.821003 1 N -73.954666 40.821003 1 18.5 0.5 0.5 5.94 0.0 0.3 25.74
9992 2 2016-01-02 01:45:32 2016-01-02 01:55:26 1 2.83 -73.985641 40.763119 1 N -74.001694 40.732391 1 10.5 0.5 0.5 2.36 0.0 0.3 14.16
9993 2 2016-01-02 01:45:32 2016-01-02 02:02:24 2 6.31 -73.972076 40.754040 1 N -73.869659 40.749451 2 19.5 0.5 0.5 0.00 0.0 0.3 20.80
9994 2 2016-01-02 01:45:32 2016-01-02 01:52:22 1 1.65 -73.992012 40.725880 1 N -74.009697 40.709923 1 7.0 0.5 0.5 1.00 0.0 0.3 9.30
9995 2 2016-01-02 01:45:33 2016-01-02 01:54:45 1 2.05 -73.989403 40.750538 1 N -74.003639 40.725395 2 9.0 0.5 0.5 0.00 0.0 0.3 10.30
9996 2 2016-01-02 01:45:34 2016-01-02 01:59:03 1 2.84 -73.974426 40.790932 1 N -73.940430 40.822159 1 12.5 0.5 0.5 3.45 0.0 0.3 17.25
9997 2 2016-01-02 01:45:34 2016-01-02 01:55:11 1 3.45 -73.989151 40.726864 1 N -73.958389 40.765392 1 11.5 0.5 0.5 1.00 0.0 0.3 13.80
9998 2 2016-01-02 01:45:35 2016-01-02 01:52:43 1 1.30 -73.968239 40.755379 1 N -73.956322 40.768002 1 7.0 0.5 0.5 1.70 0.0 0.3 10.00
9999 1 2016-01-02 01:45:37 2016-01-02 01:50:31 1 1.20 -73.982224 40.768620 1 N -73.983765 40.779598 1 6.0 0.5 0.5 2.00 0.0 0.3 9.30
10000 2 2016-01-02 01:45:37 2016-01-02 01:59:47 3 2.69 -73.960518 40.710976 1 N -73.925240 40.698357 2 12.0 0.5 0.5 0.00 0.0 0.3 13.30

10001 rows × 19 columns

In [8]:
 
#help(plotly.offline.iplot)
 
## Insight 1: Passenger numbers
 * Most NY Taxi trips transport solo passengers

Insight 1: Passenger numbers

  • Most NY Taxi trips transport solo passengers
In [9]:
import numpy as np
import plotly.plotly as py
#import plotly.offline as offline
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.graph_objs as go
init_notebook_mode()
#extract number of people per trip
peps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]
peps_per_trip_df.shape
#print(type(peps_per_trip_df))
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
#print(type(peps_per_trip))
layout=go.Layout(title="First Plot", xaxis={'title':'x1'}, yaxis={'title':'x2'})
data = [go.Histogram(x=peps_per_trip)]  #or [dataset1, darset2]
layout = go.Layout(
    title='Histogram of Passenger numbers',
    xaxis=dict(
        title='passenger number'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.2,
    bargroupgap=0.1
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,  filename='People_per_trip_histogram') #this plots in online mode, limit of 50/day in community a/c
#iplot(fig,  filename='People_per_trip_histogram') #This plots when offline; no limit
High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~elmao/0 or inside your plot.ly account where it is named 'People_per_trip_histogram'
Out[9]:
xxxxxxxxxx
## Insight 2: cash versus credit 
* New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
* Cash usage is considerable at 40%. The cash option is a point of difference over competitor Uber.  
* Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
* Peak at $\$52$  is likely to represent Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
 
* NY taxi fares are cheap (compared to Melbourne!). Median fare around \$10 

Insight 2: cash versus credit

  • New Yorkers prefer to pay with credit card (60:40 ratio in preference of credit card)
  • Cash usage is considerable at 40%. The cash option is a point of difference over competitor Uber.
  • Distribution of fares is similar across cash and credit card payments (median credit card fare is $1 higher than cash fare)
  • Peak at $52 is likely to represent Manhattan -> JFK airport trips (This has a flat rate fee of $52, source @wikipedia)
  • NY taxi fares are cheap (compared to Melbourne!). Median fare around $10
  • Median Tip (credit card data only) is 20% of the fare
In [10]:
x
# Distribution: Payment by type
# Add histogram data
# extract fares by payment type
# 1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip
fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values #credit card
fare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values #cash
#fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values #dispute
fare_payments=np.append(fare_paymenttype1,fare_paymenttype2)
total_paymentstype1=df.loc[df['payment_type'] == 1, 'total_amount'].values   #fare+tips+tols
total_paymentstype2=df.loc[df['payment_type'] == 2, 'total_amount'].values   #fare+tips+tols
tip_amountstype1=df.loc[df['payment_type'] == 1, 'tip_amount'].values   #fare+tips+tols
total_payments=np.append(total_paymentstype1,total_paymentstype2)
numberofCCpays=df.loc[df['payment_type'] == 1, 'payment_type'].sum()
numberofCashpays=df.loc[df['payment_type'] == 2, 'payment_type'].sum()/2
PcentofCCpays=np.round(numberofCCpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCCpays)
PcentofCashpays=np.round(numberofCashpays*100/(numberofCashpays+numberofCCpays), decimals=1)
#print(PcentofCashpays)
#print(type(fare_paymenttype2[1:10]))
# Group data together
hist_data = [fare_paymenttype1,fare_paymenttype2]
find_median1=np.median(fare_paymenttype1)
find_median2=np.median(fare_paymenttype2)
#print(find_median)
group_labels = ['Credit card', 'Cash']
# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, group_labels, bin_size=1.0)
fig.layout.update({'title': 'Distribution of Fares'})
fig.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig, filename='Distplot with Multiple Datasets') #online plot mode
iplot(fig, filename='Distplot with Multiple Datasets') #offline mode
from IPython.display import display, Math, Latex
display(Math(r'\text{Percentage of credit card payments is } %s \text{%%}' % PcentofCCpays))
display(Math(r'\text{Median credit payment is \$} %s ' % find_median1))
display(Math(r'\text{Percentage of cash payments is  } %s \text{%%}' % PcentofCashpays))
display(Math(r'\text{Median cash payment is \$} %s' % find_median2))
00.020.040.060.080.10.12020406080100120140160Export to plot.ly »
Distribution of FaresCashCredit card$ amounts
Percentage of credit card payments is 60.8%
Median credit payment is $9.5
Percentage of cash payments is 39.2%
Median cash payment is $8.5
xxxxxxxxxx
 
## Insight 3: fare breakdown
* Median Tip (credit card data only) is 20% of the fare

Insight 3: fare breakdown

  • Median Tip (credit card data only) is 20% of the fare
In [27]:
# Group data together
hist_data2 = [fare_payments,total_payments,tip_amountstype1]
group_labels2 = ['Fare', 'Total Charge', 'Tip Amount']
# Create distplot with custom bin_size
fig2 = ff.create_distplot(hist_data2, group_labels2, bin_size=[0.5,0.5,0.4])
fig2.layout.update({'title': 'Breakdown & Distribution of NY Taxi Fares'})
fig2.layout.xaxis1.update({'title': '$ amounts'})
# Plot!
#py.iplot(fig2, filename='Distplot with Multiple Datasets2') # online plot option
iplot(fig2, filename='Distplot with Multiple Datasets2') # offline plot option
find_mediantip=np.median(tip_amountstype1)
Med_tip_percentage=np.round(find_mediantip*100/find_median1, decimals=1)
display(Math(r'\text{Median tip payment (Credit card payment data only) is \$} %s ' % find_mediantip))
display(Math(r'\text{Median tip percentage (Credit card payment data only) is } %s \text{%%}' % Med_tip_percentage))
00.10.20.30.401020304050607080Export to plot.ly »
Breakdown & Distribution of NY Taxi FaresTip AmountTotal ChargeFare$ amounts
Median tip payment (Credit card payment data only) is $1.96
Median tip percentage (Credit card payment data only) is 20.6%
 
## Insight 3: Pick up and Drop off locations
* Manhattan (central business zone) is the busiest area for taxi use
* Airports (La Guardia and JFK) feature strongly in usage maps
    * Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse  
        * Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks?
        
        
* People **start taxi journeys** most frequently:
    1. in Manhattan on the **main streets**
    2. on the **main arterial routes** within residential areas (Brooklyn, Queens)
        * The *Sex And The City* imagery of hailing taxis on demand from busy streets is backed up by the data
    
    
* People **end taxi journeys** most frequently:
    1. again in Manhattan, both on main streets and off the main streets 
    2. at very **diffuse locations** across residential areas (Brooklyn, Queens, The Bronx)
        * The Bronx is a frequent drop-off location, but rarely a pick-up location 
            * An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)

Insight 3: Pick up and Drop off locations

  • Manhattan (central business zone) is the busiest area for taxi use
  • Airports (La Guardia and JFK) feature strongly in usage maps
    • Curiously, people get dropped off to the airports at very fixed locations, while pick-up locations are more diffuse
      • Is there a culture of people wandering out from the airport and hailing taxis from wherever; no easily to locate taxi ranks?
  • People start taxi journeys most frequently:
    1. in Manhattan on the main streets
    2. on the main arterial routes within residential areas (Brooklyn, Queens)
      • The Sex And The City imagery of hailing taxis on demand from busy streets is backed up by the data
  • People end taxi journeys most frequently:
    1. again in Manhattan, both on main streets and off the main streets
    2. at very diffuse locations across residential areas (Brooklyn, Queens, The Bronx)
      • The Bronx is a frequent drop-off location, but rarely a pick-up location
        • An effect of green "boro taxis" since 2013? (Note, however, that boroughs where green taxis can be hailed include The Bronx, Queens and Brooklyn: yet the Bronx taxi pattern is notably different)
In [53]:
 
# Map the pick up locations
import pandas as pd
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams  
df=df_big
#pd.options.display.mpl_style = 'default' #Better Styling 
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)
#P.set_axis_bgcolor('black') #Background Color
P.set_facecolor('black') #Background Colour
#plt.show()
In [55]:
 
# Map the drop off locations
df=df_big
import matplotlib  
import matplotlib.pyplot as plt 
from matplotlib import rcParams 
##Inline Plotting for jupyter Notebook 
#%matplotlib inline 
#pd.options.display.mpl_style = 'default' #Better Styling  
matplotlib.pyplot.style.use('ggplot')
new_style = {'grid': False} #Grid off  
matplotlib.rc('axes', **new_style)  
 
rcParams['figure.figsize'] = (12, 12) #Size of figure  
rcParams['figure.dpi'] = 250
P=df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='white',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.3)  #s is size and alpha is opaque-ness 
P.set_facecolor('black') #Background Colour
plt.show()
In [84]:
 
#Top 10 busiest locations of the city
#import reverse_geocoder as rg
from geopy.geocoders import Nominatim
df=df_big
#round the lat and long entries 
#Latitude_round=df.loc[df['payment_type'] == 1, 'fare_amount'].values
Latitude_round=np.round(df['pickup_latitude'].values, decimals=2)+0.005   #round and recentre grid box
Longitude_round=np.round(df['pickup_longitude'].values, decimals=2)+0.005 #round and recentre grid box
#print(Latitude_round[0:5])
#print(Longitude_round[0:5])
df.loc[:,'GridcodeLat'] = pd.Series(Latitude_round, index=df.index) #add column gridcodes to df
df.loc[:,'GridcodeLon'] = pd.Series(Longitude_round, index=df.index) #add column gridcodes to df
#find 10 locations with most common grid codes
mytable = df.groupby(['GridcodeLat','GridcodeLon']).size()
mytable.sort_values(inplace=True,ascending=False)
totaltrips=mytable.sum()
print('Total trips')
print(totaltrips)
Top10BusyPickupLocations=mytable.head(30)
#print(Top10BusyPickupLocations)
#print(type(Top10BusyPickupLocations))
Top10BusyPickupLocations=Top10BusyPickupLocations.to_frame()
print(Top10BusyPickupLocations)
print(type(Top10BusyPickupLocations))
#coordinates = (51.5214588,-0.1729636),(9.936033, 76.259952),(37.38605,-122.08385)
coordinates = Top10BusyPickupLocations.index.values.tolist()
print(coordinates)
type(coordinates)
#results = rg.search(coordinates) # default mode = 2, reverse geocode from lat and long to address
#print(results)
geolocator = Nominatim()
#locations = geolocator.reverse("40.755,     -73.985")
for i in range(0,30):
    try:
        location = geolocator.reverse(coordinates[i])
        #print(location)
    except:
        PlaceNames='Unknown, Unknown, Unknown, Unknown, Unknown'
    PlaceNames=location.address.split(",")
    print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] )
    
#df1.loc[:,'f'] = p.Series(np.random.randn(sLength), index=df1.index) #add column f to df1
#plot table or pie chart
Total trips
2389990
                              0
GridcodeLat GridcodeLon        
40.755      -73.985      147875
40.765      -73.965      136667
            -73.975      119171
            -73.985      114601
40.755      -73.975      111538
40.745      -73.985       98970
40.735      -73.985       90326
40.775      -73.955       82995
40.735      -73.995       76986
40.775      -73.975       73077
40.745      -73.995       70636
            -73.975       69773
40.785      -73.945       63615
40.725      -73.985       62791
40.785      -73.975       60271
            -73.955       58241
40.755      -73.965       47507
40.725      -73.995       46156
0.005        0.005        46056
40.775      -73.945       42486
40.735      -73.975       42232
40.765      -73.955       41652
40.795      -73.965       39834
40.755      -73.995       38234
40.715      -74.005       38047
40.745      -74.005       37152
40.775      -73.865       32844
40.725      -74.005       29397
40.775      -73.985       29167
40.765      -73.995       25774
<class 'pandas.core.frame.DataFrame'>
[(40.755, -73.985), (40.765, -73.965), (40.765, -73.97500000000001), (40.765, -73.985), (40.755, -73.97500000000001), (40.745000000000005, -73.985), (40.735, -73.985), (40.775000000000006, -73.955), (40.735, -73.995), (40.775000000000006, -73.97500000000001), (40.745000000000005, -73.995), (40.745000000000005, -73.97500000000001), (40.785000000000004, -73.94500000000001), (40.725, -73.985), (40.785000000000004, -73.97500000000001), (40.785000000000004, -73.955), (40.755, -73.965), (40.725, -73.995), (0.005, 0.005), (40.775000000000006, -73.94500000000001), (40.735, -73.97500000000001), (40.765, -73.955), (40.795, -73.965), (40.755, -73.995), (40.715, -74.00500000000001), (40.745000000000005, -74.00500000000001), (40.775000000000006, -73.86500000000001), (40.725, -74.00500000000001), (40.775000000000006, -73.985), (40.765, -73.995)]
[' Diamond District', ' Manhattan', ' Manhattan Community Board 5']
[' Lenox Hill', ' Manhattan', ' Manhattan Community Board 8']
[' Central Park South', ' Diamond District', ' Manhattan']
[" Hell's Kitchen", ' Manhattan', ' Manhattan Community Board 4']
[' Diamond District', ' Manhattan', ' Manhattan Community Board 5']
[' Rose Hill', ' Manhattan', ' Manhattan Community Board 5']
[' Flatiron', ' Manhattan', ' Manhattan Community Board 6']
[' Yorkville', ' Manhattan', ' Manhattan Community Board 8']
[' Washington Square Village', ' Manhattan', ' Manhattan Community Board 2']
[' Strawberry Fields', ' Central Park', ' Manhattan']
[' Chelsea', ' Manhattan', ' Manhattan Community Board 4']
[' Murray Hill', ' Manhattan', ' Manhattan Community Board 6']
[' Yorkville', ' Manhattan', ' Manhattan Community Board 11']
[' Alphabet City', ' Manhattan', ' Manhattan Community Board 3']
[' Upper West Side', ' Manhattan', ' Manhattan Community Board 7']
[' Yorkville', ' Manhattan', ' Manhattan Community Board 8']
[' Tudor City', ' Manhattan', ' Manhattan Community Board 6']
[' Five Points', ' Manhattan', ' Manhattan Community Board 2']
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-84-5bd6548af319> in <module>()
     44     except:
     45         PlaceNames='Unknown, Unknown, Unknown, Unknown, Unknown'
---> 46     PlaceNames=location.address.split(",")
     47     print([PlaceNames[-8],PlaceNames[-7],PlaceNames[-6]] )
     48 

AttributeError: 'NoneType' object has no attribute 'split'

In [52]:
 
#Top10BusyPickupLocations['GridcodeLat','GridcodeLon'].values
Top10BusyPickupLocations.index.values
coordinates[2]
Out[52]:
(40.765, -73.97500000000001)
In [ ]:
 
1,#plot pie chart of Top 10 busiest locations
# Add graph data
trace1={'labels': ['1st', '2nd', '3rd', '4th', '5th'],
        'values': [38, 27, 18, 10, 7],
        'type': 'pie',
        'name': 'Starry Night',
        'marker': {'colors': ['rgb(56, 75, 126)',
                              'rgb(18, 36, 37)',
                              'rgb(34, 53, 101)',
                              'rgb(36, 55, 57)',
                              'rgb(6, 4, 4)']},
            'domain': {'x': [0, 1],
                       'y': [.4, 1]},
            'hoverinfo':'label+percent+name',
            'textinfo':'none'
        }
# Add trace data to figure
figure['data'].extend(go.Data([trace1]))
# Edit layout for subplots
figure.layout.yaxis.update({'domain': [0, .30]})
# The graph's yaxis2 MUST BE anchored to the graph's xaxis2 and vice versa
# Update the margins to add a title and see graph x-labels. 
figure.layout.margin.update({'t':75, 'l':50})
figure.layout.update({'title': 'Starry Night'})
# Update the height because adding a graph vertically will interact with
# the plot height calculated for the table
figure.layout.update({'height':800})
# Plot!
py.iplot(figure)
In [ ]:
 
#classfiy into manhattan, JFK airport, laGuardia
#Q's what percentage are those airport trips
# map the fare disputes/ scrap as not many of these
#find out % of trips paid by cc versus cash
#insights: lots of drop offs to brooklyn, queens, bronx.  less pick ups from these areas.  People get taxi's home rather than to work?
#time of day?, weekend?   And people seem to get picked up from main streets!  (the sex and city iconography of hailing a cab is true!)
#interesting in times of UBER
In [90]:
 
#plot Distribution: Passenger numbers per trip
import numpy as np
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
#peps_per_triprav = peps_per_trip.ravel() 
print(peps_per_trip)
#below ply works, put plotly dist plot not happy
#print(df.shape)
#df = df.replace('[]', np.nan)#Soln (a) replace all elements that have any empty value with NaN values
#df=df.dropna()     #Soln (b) drop all rows that have any NaN values
#print(df.shape)
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
hist_data = [peps_per_trip]
group_labels = ['distplot']
#plt.plot(peps_per_trip)
#plt.show()
fig = ff.create_distplot(hist_data, group_labels)
fig['layout'].update(title='Distribution: Passenger numbers per trip')
py.iplot(fig, filename='DistplotPepsPerTrip')
[[2]
 [5]
 [1]
 ..., 
 [1]
 [1]
 [1]]
C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2487: RuntimeWarning:

Degrees of freedom <= 0 for slice

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2496: RuntimeWarning:

divide by zero encountered in double_scalars

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2496: RuntimeWarning:

invalid value encountered in multiply

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-127596d598e4> in <module>()
     21 #plt.show()
     22 
---> 23 fig = ff.create_distplot(hist_data, group_labels)
     24 
     25 fig['layout'].update(title='Distribution: Passenger numbers per trip')

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\plotly\figure_factory\_distplot.py in create_distplot(hist_data, group_labels, bin_size, curve_type, colors, rug_text, histnorm, show_hist, show_curve, show_rug)
    190             hist_data, histnorm, group_labels, bin_size,
    191             curve_type, colors, rug_text,
--> 192             show_hist, show_curve).make_kde()
    193 
    194     rug = _Distplot(

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\plotly\figure_factory\_distplot.py in make_kde(self)
    309                                    / 500 for x in range(500)]
    310             self.curve_y[index] = (scipy_stats.gaussian_kde
--> 311                                    (self.hist_data[index])
    312                                    (self.curve_x[index]))
    313 

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in __init__(self, dataset, bw_method)
    169 
    170         self.d, self.n = self.dataset.shape
--> 171         self.set_bandwidth(bw_method=bw_method)
    172 
    173     def evaluate(self, points):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in set_bandwidth(self, bw_method)
    496             raise ValueError(msg)
    497 
--> 498         self._compute_covariance()
    499 
    500     def _compute_covariance(self):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\stats\kde.py in _compute_covariance(self)
    507             self._data_covariance = atleast_2d(np.cov(self.dataset, rowvar=1,
    508                                                bias=False))
--> 509             self._data_inv_cov = linalg.inv(self._data_covariance)
    510 
    511         self.covariance = self._data_covariance * self.factor**2

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\linalg\basic.py in inv(a, overwrite_a, check_finite)
    656 
    657     """
--> 658     a1 = _asarray_validated(a, check_finite=check_finite)
    659     if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
    660         raise ValueError('expected square matrix')

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\_lib\_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact)
    226             raise ValueError('masked arrays are not supported')
    227     toarray = np.asarray_chkfinite if check_finite else np.asarray
--> 228     a = toarray(a)
    229     if not objects_ok:
    230         if a.dtype is np.dtype('O'):

C:\Users\elmaog\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\lib\function_base.py in asarray_chkfinite(a, dtype, order)
   1031     if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all():
   1032         raise ValueError(
-> 1033             "array must not contain infs or NaNs")
   1034     return a
   1035 

ValueError: array must not contain infs or NaNs

In [65]:
 
#import plotly
#plotly.tools.set_credentials_file(username='eosg', api_key='AmlsmkQM0FkVbEPtlQSf')
#plotly.tools.set_credentials_file(username='elmao', api_key='8z69RhuTfVA7EdkIEtXZ')
 
## If you run a taxi company, how would you maximize your earnings?
Uber is a major market distrupter in the taxi space.  To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.
Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)
* UberT has entered the market gap here (can request a yellow taxi to your door through the uber app)

If you run a taxi company, how would you maximize your earnings?

Uber is a major market distrupter in the taxi space. To maximise taxi company earnings, concurrent analysis of uber versus taxi data is nesscessary.

Thoughts: On cold NY winter mornings (or in the rain!) does Uber now take a big share of the historical taxi market (direct from door pick up rather than walking to major route to hail a taxi)

  • UberT has entered the market gap here (can request a yellow taxi to your door through the uber app)
In [6]:
 
#basic Histograms
#extract number of people per trip
peps_per_trip_df=df.loc[:, df.columns.str.match('passenger_count')]
peps_per_trip_df.shape
print(type(peps_per_trip_df))
peps_per_trip=df.loc[:, df.columns.str.match('passenger_count')].values
print(type(peps_per_trip))
fare_paymenttype1=df.loc[df['payment_type'] == 1, 'fare_amount'].values
fare_paymenttype2=df.loc[df['payment_type'] == 2, 'fare_amount'].values
fare_paymenttype4=df.loc[df['payment_type'] == 4, 'fare_amount'].values
type(fare_paymenttype1)
#1=cc, 2=cash, 3=no charge, 4=dispute, 5=unknown, 6=voided trip
#rate code ID (final rate code at end of the trip): 1=standard rate, 2=JFK, 3=Newark, 5=Nassau or Westchester, 5=Negotiated fare, 6=Group ride
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
Out[6]:
numpy.ndarray
In [ ]:
 
Rendering widgets...